fix(prover-node): await in-flight jobs and guard world-state on stop#23338
Draft
AztecBot wants to merge 1 commit into
Draft
fix(prover-node): await in-flight jobs and guard world-state on stop#23338AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
4af2626 to
db4ec58
Compare
8089661 to
f5bb928
Compare
86d27a8 to
ba1d174
Compare
ba1d174 to
22fb832
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the teardown SEGFAULT in CI log
92cbf66b564931cf(full analysis: original gist). Thee2e_fees/fee_settingstest body itself passed — the process died inafterAllbecause anEpochProvingJob.run()body was still calling into the native world-state addon afterProverNode.stop()returned.Applies fixes #1 and #3 from that analysis. Fix #2 was skipped per request.
Fix #1 —
ProverNode.stop()awaits in-flight jobsyarn-project/prover-node/src/prover-node.tsstartProofusedvoid this.runJob(job)(fire-and-forget) andstop()only awaitedjob.stop()for each tracked job.job.stop()only waits on the internalrunPromise(set partway throughrun()), not therunJobwrapper's post-runwork (tryUploadEpochFailure,createProvingJobon reorg) — both touch world-state.stop()calledthis.prover.stop()first, which only cancelled in-flight proving requests; the orchestrator'sasyncPoolbody kept creating forks and inserting L1→L2 messages.runJobpromises in aMap<string, Promise<void>>keyed byjob.getId()via a smalltrackRunJobhelper, and havestop()signaljob.stop()and await both the jobs and therunJobwrappers before stopping the prover/publisher/etc. (Map<string, …>rather thanSet<Promise<…>>to satisfy the project'saztec-custom/no-non-primitive-in-collectionslint rule.)Fix #3 — Defensive shutdown guard in
NativeWorldStateyarn-project/world-state/src/native/native_world_state_instance.tsclose()previously only drained the canonical (forkId=0) queue — in-flight per-fork-queue calls could race with the native CLOSE and segfault. The existingassert.equal(this.open, true, ...)was inside the queue's execute callback, so it didn't prevent new calls from being enqueued and only produced AssertionError-style failures.call()early check: if!this.open, throwNative world state is closed; cannot call <MSG>before any queue lookup.close()now drains every non-canonical per-fork queue (Promise.all(queue.stop())) before sending CLOSE on the canonical queue.Worst case during shutdown is now a recognisable JS error, never a SIGSEGV.
Drive-by:
e2e_expiration_timestampflake fixyarn-project/end-to-end/src/e2e_expiration_timestamp.test.tsSame workaround as PR #23336 used for
e2e_amm: anchor the PXE to the checkpointed tip via{ syncChainTip: 'checkpointed' }. Without it the test races with the pipelined-prune that fires after the time warp andproveWithKernelsblows up withBlock hash … not found— this surfaced on the first run of this PR that got past lint.Tests
prover-node.test.ts: newawaits in-flight epoch jobs before stop resolvesblocksjob.run()viapromiseWithResolvers, callsstop(), asserts it does NOT resolve untilrunresolves.native_world_state.test.ts: newrejects calls issued after close with a JS error rather than crashingcloses a service and asserts a follow-upfork()rejects with/closed/i.Out of scope
cancelJobsOnStop: truefor in-process simulated prover-nodes).Details: https://gist.github.com/AztecBot/e214caa9c9fdfe3b8de5fb9f9b2bf867